A Practical Index for Text Retrieval Allowing Errors

نویسندگان

  • Ricardo Baeza-Yates
  • Gonzalo Navarro
چکیده

We propose a text indexing technique for approximate pattern matching, which is practical and especially aimed at Information Retrieval (IR). Unlike other indices of this kind, it is able to retrieve any string that approximately matches a given search pattern. Every sequence of a xed length appearing in the text is stored in the index, together with pointers to all the positions where it appears. The search pattern is cut into pieces so that at least one must match exactly. All the pieces are searched in the index and the union of candidate positions is veriied. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of veriications is minimized. This also allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the build time, space requirements and query times of our index, nding that it is a practical alternative for IR.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Practical q -Gram Index for Text Retrieval Allowing Errors

We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a xed length q is stored in the index, together with pointers to all the text positions whe...

متن کامل

Large Text Searching Allowing Errors

We present a full inverted index for exact and approximate string matching in large texts. The index is composed of a table containing the vocabulary of words of the text and a list of positions in the text corresponding to each word. The size of the table of words is usually much less than 1% of the text size and hence can be kept in main memory, where most query processing takes place. The te...

متن کامل

Fusion of Multiple Corrupted Transmissions and its effect on Information Retrieval

Much previous work has focused on correction of OCR degraded text with little work addressing the possibility of fusing the generated text from different OCR systems, which are assumed to produce different types of errors. This paper explores text fusion, which involves the use of language modeling to determine which OCR system (if any) properly recognized individual words. The technique was ap...

متن کامل

Automatic structuring of text files 1

SUMMARY In many practical information retrieval situations, it is necessary to process heterogeneous text databases that vary greatly in scope and coverage, and deal with many different subjects. In such an environment it is important to provide flexible access to individual text pieces, and to structure the collection so that related text elements are identified and appropriately linked. Metho...

متن کامل

RMIT University at TREC 2008: Legal Track

This paper reports on the participation of RMIT university in the 2008 TREC Legal Track Ad Hoc task. OCR errors can corrupt the document view formed by an information retrieval system, and substantially hinder the successful retrieval of relevant documents for user queries. In previous research, the presence of errors in OCR text was observed to lead to unstable and unpredictable retrieval effe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997